A Case For Lightweight Dynamic Event Based Monitoring and Management Support For Large Scale DataCenters
نویسنده
چکیده
Future large scale systems, such as cloud datacenters, with increased core counts will soon result in infrastructures with millions of cores. This poses challenges for monitoring and management not yet met by existing experimental or commercial software systems. At the core of this challenge is to perform continuous and on-demand monitoring queries over distributed aggregated data resulting from distributed monitoring data streams. Of particular importance is the ability to quickly detect, correlate and react to system issues. Our research is developing monitoring and management methods and infrastructure that can scale and also exhibit small lag. Our approach is to use event based system design and distribute monitoring, select data aggregation and analysis and actuation across datacenter subsystems and machines. Finally, by embedding management into the underlying system and management infrastructure, based on modern support for virtualization, management is separated from applications, enabling its usage and change without affecting user codes. 1. PROBLEM DESCRIPTION Existing monitoring and management systems with single or multiple hierarchy for monitoring data aggregation and centralized analysis and coordination, are not designed to scale for future large scale systems. Some of the key challenges for monitoring and management systems are 1. Ability to proactively detect, correlate and react to system issues. 2. Ability to change part of monitoring data path due to changes in monitors, systems, subsystems or platforms. 3. Exhibit small lag in reacting to system issues caused due to external or internal factors affecting the systems. 4. Global and hierarchical correlated view of the information at an acceptable overhead and within an allowable precision. 5. Distributed coordination among various hierarchies of aggregated data, monitors and actuators across the environment. In particular, for large scale systems concerning our research such as future datacenters, key trends are shaping future technologies. First, there is an inexorable move toward many-core chips, which are increasingly composed of both general purpose cores and those specialized to certain tasks. Coupled with increased blade server densities and hardware disaggregation, these result in numbers of end systems and a degree of heterogeneity that makes it imperative to intimately integrate facilities for online and automated management into such systems. Second, increased demand for ever larger and more reactive datacenters, in part driven by cloud computing, will lead to scales of millions of cores, making it critically important for automated management methods to scale. It also implies the need for them to exhibit basic properties that include extensibility, the ability to interact with diverse management subsystems, and robustness. Third, virtualization is a becoming a necessary element of any study in systems management, in part because of its known benefits like server consolidation in datacenter environments. Thus two most important categories of system issues for monitoring and management systems focused in our research concerning future datacenter environment are 1. Dynamism With the growing complexity of various groups, applications and services in particular importance to the datacenter management is the ability to quickly detect and react to system issues, in order to mitigate and contain effects deleterious to application performance or datacenter health. The monitoring and management infrastructure should exhibit small lag under changes occuring due to dynamic and distributed virtualized datacenter environment. 2. Scalability The traditional approaches using centralized and reactive techniques for monitoring, data aggregation and analysis and actuation across datacenter subsystems and machines will not be able to scale to millions of core. Each of these components will have to be distributed and thus sheer scale introduces a problem of coordinated distributed decision across the datacenter environment. The monitoring and management mechanisms needs to scale in and scale out efficiently.
منابع مشابه
Concurrent control on resource planning and revenue/expenditure estimation in large-scale shell material embankment projects management using discrete-event simulation
Resource planning in large-scale construction projects has been a complicated management issue requiring mechanisms to facilitate decision making for managers. In the present study, a computer-aided simulation model is developed based on concurrent control of resources and revenue/expenditure. The proposed method responds to the demand of resource management and scheduling in shell material emb...
متن کاملA Dynamic Group Management Framework for Large-scale Distributed Event Monitoring
Distributed event monitoring is an important service for fault, performance and security management. Next generation event monitoring services are higly distributed and invovling a large number of monitoring agents. In order to support scalabel event monitoring, the monitoring agents use IP multicasting as a group communication for exchanging events and control information. However, dueto the d...
متن کاملA Fuzzy Decision-Making Methodology for Risk Response Planning in Large-Scale Projects
Risk response planning is one of the main phases in the project risk management and has major impacts on the success of a large-scale project. Since projects are unique, and risks are dynamic through the life of the projects, it is necessary to formulate responses of the important risks. The conventional approaches tend to be less effective in dealing with the impreciseness of risk response p...
متن کاملA Multi-Criteria Decision-Making Approach with Interval Numbers for Evaluating Project Risk Responses
The risk response development is one of the main phases in the project risk management that has major impacts on a large-scale project’s success. Since projects are unique, and risks are dynamic through the life of the projects, it is necessary to formulate responses of the important risks. Conventional approaches tend to be less effective in dealing with the imprecise of the risk response deve...
متن کاملA discrete-event optimization framework for mixed-speed train timetabling problem
Railway scheduling is a complex task of rail operators that involves the generation of a conflict-free train timetable. This paper presents a discrete-event simulation-based optimization approach for solving the train timetabling problem to minimize total weighted unplanned stop time in a hybrid single and double track railway networks. The designed simulation model is used as a platform for ge...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009